by Aishwarya Upadhyay towards Udacity nanodegree submission
Here we explore the dataset for its attributes, fields and contents How many columns and rows are there in the dataframe we are going to process.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
This tidy dataset contains 1,599 red wines with 13 variables (which we are going to reduce) on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
## [1] 1599 13
We will conduct an Exploratory Data Analysis in order to develop intuition about this dataset, extract insights that may uncover relevant questions, and eventually prepare the development of predictive models.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Since we don’t need X variable because its just used for indexing we will drop the column
And now let’s describe the contents of our DataFrame after dropping X column
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Let’s Begin our explorations with a Univariate Analysis to identify variables that have little or no impact on wine quality, focusing on the variation of the variables.
To begin, we’re going to look at the analysis distribution of our variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Fixed acidity transformed logarithmically follows a normal distribution. Most values range from 7.10 to about 12 g / dm^3. There are few outliers also but they are very less in number. fixed.acidity has a range from 4.60 to 15.90 with mean at 8.32 and median at 7.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Volatile Acidity range from 0.12 to 1.58 g / dm^3, with a mean at 0.5278 and a median at 0.52.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Citric Acid values range from 0.00 to 1.00 g / dm^3, with a mean at 0.260 and a median at 0.271. Although we can’t call it right skewed because values remain relatively even compared to one another, low citric wines are more numerous than high citric wine. Values range from 0 to 1 g / dm^3, but values at 1 are outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
For residual sugar we have a right skewed distribution with a few outliers above 10 g / dm^3. Values range from 0.9 to 15.5 g / dm^3. The mean is 2.539 and the median 2.2.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Chlorides follow a right-skewed distribution as well. They range from 0.012 to 0.611 g / dm^3, with three clusters. The range is extremely small, the impact of this variable might be negligeable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Most wines have low free S02: the higher the Free S02 level, the less the count. We have an outlier around 68 mh / dm^3. Values range from 1 to 72 g / dm^3, with a mean at 15.87 and a median at 14.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
wines with a low total S02 level are more in number, and the higher the level, the less wine there is in our sample. Values range from 6 to 289 mg / dm^3, with a mean at 46.47 and a median at 38.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
Density follows a normal distribution, ranging from 0.9901 to 1.0037 g / cm^3, with a mean at 0.9967 and a median at 0.9968. It is distributed over over a very small range, so this variable might be negligeable too.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
pH follows a normal distribution ranging from 2.74 to 4.01, with a few outliers around 2.75 and above 3.75. The mean is 3.311 and the median 3.31.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Sulphates follows a normal distribution as well, a little skewed to the right. We have outliers around 1.6 and 1.8 g / dm^3. The values range from 0.33 to 2 g / dm^3, with a mean at 0.6581 and a median at 62.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The alcohol distribution is right skewed, ranging from 8.40 degrees by volume to 14.90. We have outliers below 9, and above 14. The mean is 10.42 and the median 10.20.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
In theory, grades can range from 0 to 10. Effectively, they range from 3 to 8, with a median at 6 and a mean at 5.636.
Quality follows a normal distribution. As such, we have little data regarding very low and very high grades, and must be be cautious when drawing conclusions from these.
Also, The main feature of interest of this dataset is the quality variable, which is supposedly impacted by all the other variables.
Most of our variables were highly right skewed, so we had to use logarithmic transformations wherever appropriate.
Density and chlorides are distributed over very small ranges. No matter the expertise of the three oenologists that graded the wines, it would be unimaginable to distinguish variations over such a small range. Therefore, it is likely that these variables had a negligeable impact over the final quality values, that we are going to see in the bivariate analysis section too.
Fixed acidity and alcohol, on the other hand, may very well have an important weight in the final grade, that would be an interesting insight to explore too.
Finally, because the quality histogram follows a gaussian distribution, we should be cautious regarding our our analyses and conclusions about low and high quality wines.
Let’s Explore the correlation between fixed acidity and density.
##
## Pearson's product-moment correlation
##
## data: rw$fixed.acidity and rw$density
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6399847 0.6943302
## sample estimates:
## cor
## 0.6680473
As it is clearly visible from our correlation value 0.6680473 that density and fixed acidity are highly correlated. After having a look at the graph we come to know that yes as fixed acidity is increasing, density tends to increase as well.
Now, lets explore the correlation between fixed acidity and pH. We know that as the acidity increases pH value decreases and vice versa.
##
## Pearson's product-moment correlation
##
## data: rw$fixed.acidity and rw$pH
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7082857 -0.6559174
## sample estimates:
## cor
## -0.6829782
From our correlation value -0.6829782 we can see that ph and acidity are correlated and as the acidity decreases pH increases and as pH increases acidity decreases.
Let’s zoom in for more clear insights In the plot above we have removed few outliers and zoomed in to see the distribution of values.
Now, lets explore the correlation of Citric Acid and Fixed Acidity.
##
## Pearson's product-moment correlation
##
## data: rw$fixed.acidity and rw$citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6438839 0.6977493
## sample estimates:
## cor
## 0.6717034
##
## Pearson's product-moment correlation
##
## data: rw$quality and rw$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
There is a moderately strong correlation of 0.476 between alcohol and quality. Here, we can see that our quality grade definitely goes up with the alcohol rate medians. It would indicate that alcohol has an important impact on quality.
##
## Pearson's product-moment correlation
##
## data: rw$quality and rw$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
There is a relatively low correlation of 0.25 between alcohol and sulphates.
##
## Pearson's product-moment correlation
##
## data: rw$quality and rw$chlorides
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
There is an inverse correlation of -0.128 between alcohol and Chlorides. Although the value is small and the correlation is weaker, it seems that chlorides affect the quality of the wine.
We can see here that quality seems deccrease with the rate of chlorides.
Let’s zoom a bit
##
## Pearson's product-moment correlation
##
## data: rw$quality and rw$chlorides
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
It’s indeed great to see how small quantity of chlorides have such a big impact over the quality. Chlorides are
Here it is a correlation matrice which explains the relationship of all the variables with each other.
So, from the observations above we can undestand that Quality and alcohol are strongly correlated. And hence we can conclude that their relationship is very strong. We have also seen that : - Fixed Acidity and Density are strongly positively correlated. - Fixed Acidity and Citric Acid are also strongly correlated. - Fixed acidity and pH are not correlated positively. - Quality is dependent on Alcohol, Sulphates, citric acid and fixed acidity. -
Free Sulfur and Total Sulfur ### Quality by Alcohol and residual sugar
##
## Pearson's product-moment correlation
##
## data: rw$residual.sugar and rw$alcohol
## t = 1.6829, df = 1597, p-value = 0.09258
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.006960058 0.090909069
## sample estimates:
## cor
## 0.04207544
From the plot above we can see, as the variables are not correlated strongly and it is also visible from the plot that there are not so good correlation to be observed. As the residual sugar increases, alcohol tends to decrease.
Most of our data has a volatile acidity between 0.4 and a little above 0.6.
##
## Pearson's product-moment correlation
##
## data: rw$volatile.acidity and rw$alcohol
## t = -8.2546, df = 1597, p-value = 3.155e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2488416 -0.1548020
## sample estimates:
## cor
## -0.202288
We can identify a cluster here, loosely in the 11 to 13 range for alcohol degree, and 0.2 and 0.4 for volatile acidity, where dots tend to be high quality blue. The higher the volatile acidity, the hotter the color. The same holds true for low alcohol levels, where white-blue combination (light blue) dominates.
##
## Pearson's product-moment correlation
##
## data: rw$total.sulfur.dioxide and rw$free.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6395786 0.6939740
## sample estimates:
## cor
## 0.6676665
Insights Here we can understand how beautifully Free Free SO2 and Total SO2 are correlated and contribute to the Quality. Correlation variable is also showing very strong correlation. Which is visibly clear through the plot.
This dataset containing 1599 entries and 12 variables on the chemical properties of the wine. The Univariate Analysis enabled us to understand the distribution of each variable.
The Quality variable is discrete and others are continous. So we would use the discrete Quality variable to make a plot for Quantity of wines according to the quality.
## Observations of plot 1 The distribution of red wine quality appears to be normal. 82.5% of wines are rated 5 and 6 (average quality). Although the rating scale is between 0 and 10, there exists no wine that is rated 1, 2, 9 or 10.
In this plotting, we can observe how even small quantity of chlorides are very significant in making an impact over the quality. Yes there exists outliers which we trimmed out but we can observe that chlorides in different types of wines exists in very small quantity but it significantly makes an impact
##
## Pearson's product-moment correlation
##
## data: rw$total.sulfur.dioxide and rw$free.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6395786 0.6939740
## sample estimates:
## cor
## 0.6676665
This data set contains 1599 red wines. Each wine has 12 variables. Initially when I looked at the dataset for red wine, I was thinking that I have reached a road with 13 ways to go. I couldn’t understand where to start from. Then I started exploring single variables how are they distributed.
In the univariate analysis of the data we explored all the dataset variables. Except Quality all the variables are continuous.
Then after exploring the dataset I could figure out that few variables are related to each other strongly and few are not. Overall, the highest variation observed within the variables lead me to more interesting questions, that mostly everything would depend upon the quality of wine.
So, I started focusing over the quality as it was the only variable that was having disrete values and I tried using it to explore other variables along with quality.
There I found very interesting observations over how the quality is affected by different variables and then using the correlation parameter, I plotted few graphs to verify the authenticity of the correlation variable. - Also, in the data set, about 80% of the red wines score 5 and 6, very low scores and higher scores, so we can understand that many of the wines were of quality 5 and 6. - Also, the most interesting plots I shared in the Final Plots section which were “The Variation of Quality with Free SO2 and Total SO2” and “Distribution of Number of wines with Quality”
Future works could include to have a dataset, where apart from the wine quality, a rank is given for that particular wine by 5 different wine tasters as we know when we include the human element, our opinion changes on so many different factors. So by including the human element in my analysis, I would be able to put in that perspective and see a lot of unseen factors which might result in a better or worse wine quality. Having these factors included inside the dataset would result in a different insight altogether in my analysis.